People constantly use language to learn about the world. Computational linguists have capitalized on this fact to build large language models (LLMs) that acquire co-occurrence-based knowledge from language corpora. LLMs achieve impressive performance on many tasks, but the robustness of their world knowledge has been questioned. Here, we ask: do LLMs acquire generalized knowledge about real-world events? Using curated sets of minimal sentence pairs (n=1215), we tested whether LLMs are more likely to generate plausible event descriptions compared to their implausible counterparts. We found that LLMs systematically distinguish possible and impossible events (The teacher bought the laptop vs. The laptop bought the teacher) but fall short of human performance when distinguishing likely and unlikely events (The nanny tutored the boy vs. The boy tutored the nanny). In follow-up analyses, we show that (i) LLM scores are driven by both plausibility and surface-level sentence features, (ii) LLMs generalize well across syntactic sentence variants (active vs passive) but less well across semantic sentence variants (synonymous sentences), (iii) some, but not all LLM deviations from ground-truth labels align with crowdsourced human judgments, and (iv) explicit event plausibility information emerges in middle LLM layers and remains high thereafter. Overall, our analyses reveal a gap in LLMs' event knowledge, highlighting their limitations as generalized knowledge bases. We conclude by speculating that the differential performance on impossible vs. unlikely events is not a temporary setback but an inherent property of LLMs, reflecting a fundamental difference between linguistic knowledge and world knowledge in intelligent systems.
translated by 谷歌翻译
本文介绍了AILAB-UDINE团队为SMM4H 22共享任务开发的模型。我们探索了基于变压器的模型在文本分类,实体提取和实体归一化,解决任务1、2、5、6和10的极限。使用集合学习时的不同体系结构,以及生成模型的巨大潜力,以实现术语归一化。
translated by 谷歌翻译
在过去的十年中,越来越多的用户开始在社交媒体平台,博客和健康论坛上报告不良药物事件(ADE)。鉴于大量报告,药物宣传的重点是使用自然语言处理(NLP)技术快速检查这些大量文本收集的方法,从而提到了与药物相关的不良反应对触发医学调查的提及。但是,尽管对任务和NLP的进步越来越兴趣,但面对语言现象(例如否定和猜测),这些模型的鲁棒性是一个公开的研究问题。否定和猜测是自然语言中普遍存在的现象,可以严重阻碍自动化系统区分文本中事实和非事实陈述的能力。在本文中,我们考虑了在社交媒体文本上进行ADE检测的四个最新系统。我们介绍了Snax,这是一种基准测试,以测试其性能,以对包含被否定和推测的ADE的样品进行样本,显示它们针对这些现象的脆弱性。然后,我们引入了两种可能提高这些模型的鲁棒性的可能策略,表明它们俩都带来了大幅提高性能,从而将模型预测的伪造实体数量降低了60%以否定为否定,而猜测为80%。
translated by 谷歌翻译